[1.x] Backport Fix for duplicate subgraph inputs/outputs (#16131) #19112

samskalicky · 2020-09-10T17:28:29Z

Description

This PR started out as a backport #16131 to v1.x. But then ran into some issues with MKLDNN subgraphing. I wasnt able to enhance the MKLDNN subgraphing to work with deduplicated inputs/outputs since there were too many hard-coded magic numbers to decode. Instead I added a flag dedup_subgraph to enable deduplicating subgraph inputs/outputs for our needs, but disable it by default to keep current MKLDNN/TensorRT and any other partitioning flows.

So now we set an attribute on the Graph and check for it in build_subgraph.cc "dedup_subgraph":

bool dedup_subgraph = g->HasAttr("dedup_subgraph");
...
CutGraphInputs(input_entries, &orig_input_entries, &unique_orig_entries,
                 &unique_input_entries, false, dedup_subgraph);
...
if (dedup_subgraph)
      subg_prop->ConnectSubgraphInputs(n, &unique_input_entries, &unique_orig_entries);
else
      subg_prop->ConnectSubgraphInputs(n, &input_entries, &orig_input_entries);

We set this attribute on the Graph in MXOptimizeForBackend :

// set dedup option as attribute on graph to enable dedup during partitioning
if (options_map.count("dedup_subgraph") > 0 &&
      options_map.at("dedup_subgraph").compare("True") == 0)
g.attrs["dedup_subgraph"] = std::make_shared<nnvm::any>(std::string("True"));

And it is set by users as an argument to optimize_for:

part_sym = sym.optimize_for(subgraph_backend, arg_dict, aux_dict, dedup_subgraph=True)

* fix for duplicate inputs * fixed error * fixed whitespace * Remove duplicate outputs from subgraphs * changed subgraph to create map of outputs * added static_cast * changed map<int,v> to vector * sanity fix * sanity2 * updated backends with new connectSubgraphOutputs API * fixed map creation logic * added updates for reattach function * creating node only if it is not an input to subgraph * creating object based on var_name only * updating ConnectSubgraphOutputs for mkldnn_elemwisemul_post_quantize_property.h * add debug prints to debug error in CI * remove prints * added prints to debug in the CI * revert changes * reverted changes * deduplicaated inputs to subgraph * deduplicated subgraph inputs * simplified inputs * cleaned up * deduplicate outputs * cleand up * added deduplication to subgraph node outputs * fixed prev compare * fixed issue with inputs and added test * fixd whitespace, removed prints Co-authored-by: Sam Skalicky <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Manu Seth <[email protected]> Co-authored-by: Ubuntu <[email protected]>

mxnet-bot · 2020-09-10T17:28:33Z

Hey @samskalicky , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [miscellaneous, windows-cpu, unix-cpu, edge, unix-gpu, website, windows-gpu, centos-gpu, centos-cpu, sanity, clang]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

samskalicky · 2020-09-11T01:01:17Z

@ZhennanQin @pengzhao-intel Im debugging a test failure with this PR:

[2020-09-10T19:49:11.883Z] ======================================================================
[2020-09-10T19:49:11.883Z] ERROR: test_subgraph.test_mobilenetv2_struct
[2020-09-10T19:49:11.883Z] ----------------------------------------------------------------------
[2020-09-10T19:49:11.883Z] Traceback (most recent call last):
[2020-09-10T19:49:11.883Z]   File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
[2020-09-10T19:49:11.883Z]     self.test(*self.arg)
[2020-09-10T19:49:11.883Z]   File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 215, in test_new
[2020-09-10T19:49:11.883Z]     orig_test(*args, **kwargs)
[2020-09-10T19:49:11.883Z]   File "/work/mxnet/tests/python/mkl/test_subgraph.py", line 815, in test_mobilenetv2_struct
[2020-09-10T19:49:11.883Z]     check_fusion(net, data_shape, attrs, out_types=['int8', 'auto'])
[2020-09-10T19:49:11.883Z]   File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 215, in test_new
[2020-09-10T19:49:11.883Z]     orig_test(*args, **kwargs)
[2020-09-10T19:49:11.883Z]   File "/work/mxnet/tests/python/mkl/test_subgraph.py", line 271, in check_fusion
[2020-09-10T19:49:11.883Z]     exe = sym.bind(ctx=mx.current_context(), args=arg_array, aux_states=aux_array, grad_req='null')
[2020-09-10T19:49:11.883Z]   File "/work/mxnet/python/mxnet/symbol/symbol.py", line 2119, in bind
[2020-09-10T19:49:11.883Z]     ctypes.byref(handle)))
[2020-09-10T19:49:11.883Z]   File "/work/mxnet/python/mxnet/base.py", line 246, in check_call
[2020-09-10T19:49:11.883Z]     raise get_last_ffi_error()
[2020-09-10T19:49:11.883Z] mxnet.base.MXNetError: Traceback (most recent call last):
[2020-09-10T19:49:11.883Z]   [bt] (9) /usr/bin/python3(PyEval_EvalFrameEx+0x4eff) [0x53fe5f]
[2020-09-10T19:49:11.883Z]   [bt] (8) /usr/bin/python3(PyObject_Call+0x47) [0x5c59d7]
[2020-09-10T19:49:11.883Z]   [bt] (7) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(+0x9fcb) [0x7f7bac29efcb]
[2020-09-10T19:49:11.883Z]   [bt] (6) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(_ctypes_callproc+0x49a) [0x7f7bac2ab01a]
[2020-09-10T19:49:11.883Z]   [bt] (5) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call+0x2eb) [0x7f7bac2b088b]
[2020-09-10T19:49:11.883Z]   [bt] (4) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7f7bac2b0e20]
[2020-09-10T19:49:11.883Z]   [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(MXExecutorBindEX+0xdcb) [0x7f7b0675858b]
[2020-09-10T19:49:11.883Z]   [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::Executor::Bind(nnvm::Symbol, mxnet::Context const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mxnet::Context, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mxnet::Context> > > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, mxnet::Executor*)+0x39b) [0x7f7b05d9d67b]
[2020-09-10T19:49:11.883Z]   [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x58f27f1) [0x7f7b05d9b7f1]
[2020-09-10T19:49:11.883Z]   [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x61) [0x7f7b01151441]
[2020-09-10T19:49:11.883Z]   File "src/executor/graph_executor.cc", line 1892
[2020-09-10T19:49:11.883Z] MXNetError: Check failed: arg_names.size() == in_args_map.size() (8 vs. 7) :

https://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-cpu/branches/PR-19112/runs/1/nodes/296/steps/781/log/?start=0
And I think I narrowed it down to this part of the mkldnn conv subgraph:
https://github.com/apache/incubator-mxnet/blob/2d077db1c92dbf7db979e15609fddf5f371277c0/src/operator/subgraph/mkldnn/mkldnn_conv_property.h#L273-L277
Commenting out the rotation seems to resolve the issue.

But now im getting a segfault:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fffe60e4bc9 in nnvm::pass::(anonymous namespace)::MXAllocMemory(nnvm::Graph const&, nnvm::IndexedGraph const&, std::pair<unsigned int, unsigned int> const&, std::vector<int, std::allocator<int> >*, std::vector<int, std::allocator<int> >*, std::vector<unsigned int, std::allocator<unsigned int> > const&, nnvm::pass::(anonymous namespace)::MXGraphAllocator*) () from /home/ubuntu/subgraph_fixv18/python/mxnet/../../lib/libmxnet.so
(gdb) bt
#0  0x00007fffe60e4bc9 in nnvm::pass::(anonymous namespace)::MXAllocMemory(nnvm::Graph const&, nnvm::IndexedGraph const&, std::pair<unsigned int, unsigned int> const&, std::vector<int, std::allocator<int> >*, std::vector<int, std::allocator<int> >*, std::vector<unsigned int, std::allocator<unsigned int> > const&, nnvm::pass::(anonymous namespace)::MXGraphAllocator*) ()
   from /home/ubuntu/subgraph_fixv18/python/mxnet/../../lib/libmxnet.so
#1  0x00007fffe60e6884 in nnvm::pass::(anonymous namespace)::MXPlanMemory(nnvm::Graph) ()
   from /home/ubuntu/subgraph_fixv18/python/mxnet/../../lib/libmxnet.so
#2  0x00007fffe60ac9bc in std::_Function_handler<nnvm::Graph (nnvm::Graph), nnvm::Graph (*)(nnvm::Graph)>::_M_invoke(std::_Any_data const&, nnvm::Graph&&) () from /home/ubuntu/subgraph_fixv18/python/mxnet/../../lib/libmxnet.so
#3  0x00007fffe7f8a4ce in nnvm::ApplyPasses(nnvm::Graph, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) ()
   from /home/ubuntu/subgraph_fixv18/python/mxnet/../../lib/libmxnet.so
#4  0x00007fffe61318ad in nnvm::ApplyPass(nnvm::Graph, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /home/ubuntu/subgraph_fixv18/python/mxnet/../../lib/libmxnet.so
#5  0x00007fffe69e97c9 in mxnet::exec::GraphExecutor::FinishInitGraph(nnvm::Symbol, nnvm::Graph, mxnet::Executor*, std::unordered_map<nnvm::NodeEntry, mxnet::NDArray, nnvm::NodeEntryHash, nnvm::NodeEntryEqual, std::allocator<std::pair<nnvm::NodeEntry const, mxnet::NDArray> > > const&) ()
   from /home/ubuntu/subgraph_fixv18/python/mxnet/../../lib/libmxnet.so
#6  0x00007fffe69eb1c6 in mxnet::exec::GraphExecutor::Init(nnvm::Symbol, mxnet::Context const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mxnet::Context, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mxnet::Context> > > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, mxnet::Executor*, std::unordered_map<nnvm::NodeEntry, mxnet::NDArray, nnvm::NodeEntryHash, nnvm::NodeEntryEqual, std::allocator<std::pair<nnvm::NodeEntry const, mxnet::NDArray> > > const&) () from /home/ubuntu/subgraph_fixv18/python/mxnet/../../lib/libmxnet.so
#7  0x00007fffe69f7f86 in mxnet::Executor::Bind(nnvm::Symbol, mxnet::Context const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mxnet::Context, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mxnet::Context> > > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, mxnet::Executor*) ()
   from /home/ubuntu/subgraph_fixv18/python/mxnet/../../lib/libmxnet.so

I debugged further and found the segfault was coming from here:
https://github.com/apache/incubator-mxnet/blob/664889a2c02705352061c0b43c2c9b46085bd868/src/operator/subgraph/mkldnn/mkldnn_conv.cc#L80-L83
I changed the return 2 +... to return 1 +... now that we removed extra inputs. Can you guys explain how this "magic calculation" works?

samskalicky · 2020-09-11T06:51:14Z

There seems to be a lot of hard coded magic numbers in the subgraph/mkldnn. @HahTK I dont think we're going to be able to get this PR merged in time for the v1.8 code freeze without some help. Worst case we'll just get this later on master in 2.0.

samskalicky · 2020-09-12T00:18:45Z

@HahTK ive added support for specifying dedup_subgraph=True to optimize_for to enable subgraph input/output deduplication for our usage. This will allow us to move forward in parallel while we work with the Intel folks to incorporate the improvement for MKLDNN later.

samskalicky · 2020-09-13T05:02:04Z

@mseth10 @HahTK this PR needs another review since we ran into issues and had to tweak the features a bit. Please review, thanks!

HahTK · 2020-09-14T21:33:16Z

src/operator/subgraph/build_subgraph.cc

+                    std::vector<nnvm::NodeEntry> *unique_orig_entries,
+                    std::vector<nnvm::NodeEntry*> *unique_input_entries,
+                    const bool skip_var = false,
+                    const bool dedup = false) {


This comment applies everywhere where we set the defaults for the dedup flag.

It seems like this is more a problem with MKLDNN.
I understand they have done that for master branch.
Ideally, they fix it at their end for 1.8 as well.

Failing that, we should go with default = true, except when compiled for MKLDNN and optimize_for() is called by CPU. All other paths would want to optimize away the dupe input.
This amounts to MKLDNN patching their fix.

GIven the timelines for MXNet 1.8, this may be too tight anyway.
More importantly, the MKLDNN issues are fixed in master.

I am ok going ahead with this as is

mseth10

LGTM

bgawrych · 2020-09-24T13:21:32Z

@samskalicky I've working on enabling MKLDNN partitioning on master and discovered that this PR broke this mobilenet_struct_v2 test. I haven't seen this PR before, so I'm sorry for late response:

I've described where problem is in the picture above.
I want to ask if you have some idea how this could be resolved.
I thought about providing additional parameter to MKLDNN fused operator, but I consider if there is some other better way?

samskalicky · 2020-09-24T17:40:06Z

@samskalicky I've working on enabling MKLDNN partitioning on master and discovered that this PR broke this mobilenet_struct_v2 test. I haven't seen this PR before, so I'm sorry for late response:
I've described where problem is in the picture above.
I want to ask if you have some idea how this could be resolved.
I thought about providing additional parameter to MKLDNN fused operator, but I consider if there is some other better way?

Hi @bgawrych sorry looks like I tagged the wrong people, will start tagging you for MKLDNN related issues in the future.

There seems to be a lot of hard coded magic numbers in the subgraph/mkldnn

The problem that I ran into in v1.x was that the mkldnn partitioning flow was expecting the input to be duplicated, and was playing around with wiring up the the subgraph nodes. I wasnt able to decode the current flow and make it work with deduplicated inputs.

Here are two of the places where the code is expecting inputs to be duplicated:
https://github.com/apache/incubator-mxnet/blob/b9105784fa4d9ffc589c4d1010b3384a5420969d/src/operator/subgraph/mkldnn/mkldnn_conv.cc#L80-L83
https://github.com/apache/incubator-mxnet/blob/b9105784fa4d9ffc589c4d1010b3384a5420969d/src/operator/subgraph/mkldnn/mkldnn_conv.cc#L126-L129

Ideally, I would want to refactor the code so that there no baked in expectations. And however the subgraph is partitioned, will work with the subgraph op (ie. SgMKLDNNConvOperator).

samskalicky · 2020-09-28T15:28:27Z

In the meantime, i'll work on porting the dedup_subgraph option to the master branch and we can work on improving the MKLDNN subgraphing there.

part_sym = sym.optimize_for(subgraph_backend, ..., dedup_subgraph=True)

* initial commit * update build_subgraph * added test * calling test Co-authored-by: Ubuntu <[email protected]>

samskalicky mentioned this pull request Sep 11, 2020

Fix for duplicate subgraph inputs/outputs #16131

Merged

Ubuntu added 6 commits September 11, 2020 21:41

added flag to enable dedupe ondemand

ff93456

fixed dedup logic

43c89fe

improved dedup logic

eace5f9

fixed sanity

dbdeee6

propogated option

fbfe0aa

check option in custom subgraph prop

3118aa4

Ubuntu added 5 commits September 12, 2020 00:44

fixed options map

10e92fc

fixed missing

2afa153

added dedup to subgraph_prop base class for testing

04f4517

added test for dedup

e76eb8a

added comments

09403ec

samskalicky requested review from aaronmarkham and szha as code owners September 13, 2020 01:36

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review labels Sep 13, 2020

samskalicky added the pr-awaiting-review PR is waiting for code review label Sep 13, 2020

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review labels Sep 13, 2020

lanking520 added pr-awaiting-review PR is waiting for code review pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-awaiting-review PR is waiting for code review pr-awaiting-testing PR is reviewed and waiting CI build and test labels Sep 13, 2020

samskalicky mentioned this pull request Sep 14, 2020

[RFC] v1.8.0 release #18800

Open

HahTK reviewed Sep 14, 2020

View reviewed changes

HahTK approved these changes Sep 14, 2020

View reviewed changes

mseth10 approved these changes Sep 14, 2020

View reviewed changes

samskalicky added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-review PR is waiting for code review labels Sep 14, 2020

samskalicky merged commit 9dfac79 into apache:v1.x Sep 15, 2020

samskalicky mentioned this pull request Sep 28, 2020

Add dedup flag to master from #19112 #19246

Merged

samskalicky added a commit that referenced this pull request Oct 1, 2020

Add dedup flag to master from #19112 (#19246)

5dc6cad

* initial commit * update build_subgraph * added test * calling test Co-authored-by: Ubuntu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1.x] Backport Fix for duplicate subgraph inputs/outputs (#16131) #19112

[1.x] Backport Fix for duplicate subgraph inputs/outputs (#16131) #19112

samskalicky commented Sep 10, 2020 •

edited

Loading

mxnet-bot commented Sep 10, 2020

samskalicky commented Sep 11, 2020 •

edited

Loading

samskalicky commented Sep 11, 2020

samskalicky commented Sep 12, 2020

samskalicky commented Sep 13, 2020

HahTK Sep 14, 2020

mseth10 left a comment

bgawrych commented Sep 24, 2020

samskalicky commented Sep 24, 2020

samskalicky commented Sep 28, 2020

[1.x] Backport Fix for duplicate subgraph inputs/outputs (#16131) #19112

[1.x] Backport Fix for duplicate subgraph inputs/outputs (#16131) #19112

Conversation

samskalicky commented Sep 10, 2020 • edited Loading

Description

mxnet-bot commented Sep 10, 2020

samskalicky commented Sep 11, 2020 • edited Loading

samskalicky commented Sep 11, 2020

samskalicky commented Sep 12, 2020

samskalicky commented Sep 13, 2020

HahTK Sep 14, 2020

Choose a reason for hiding this comment

mseth10 left a comment

Choose a reason for hiding this comment

bgawrych commented Sep 24, 2020

samskalicky commented Sep 24, 2020

samskalicky commented Sep 28, 2020

samskalicky commented Sep 10, 2020 •

edited

Loading

samskalicky commented Sep 11, 2020 •

edited

Loading